On evaluation metrics for medical applications of artificial intelligence

Clinicians and software developers need to understand how proposed machine learning (ML) models could improve patient care. No single metric captures all the desirable properties of a model, which is why several metrics are typically reported to summarize a model’s performance. Unfortunately, these measures are not easily understandable by many clinicians. Moreover, comparison of models across studies in an objective manner is challenging, and no tool exists to compare models using the same performance metrics. This paper looks at previous ML studies done in gastroenterology, provides an explanation of what different metrics mean in the context of binary classification in the presented studies, and gives a thorough explanation of how different metrics should be interpreted. We also release an open source web-based tool that may be used to aid in calculating the most relevant metrics presented in this paper so that other researchers and clinicians may easily incorporate them into their research.

Improving healthcare applications and supporting decision-making for medical professionals using methods from Artificial Intelligence (AI), specifically Machine Learning (ML), is a rapidly developing field with numerous retrospective studies being published every week. We observe an increasing number of prospective studies involving large multi-center clinical trials testing ML systems' suitability for clinical use. The technical and methodological maturity of the different areas varies, radiology and dermatology being examples of the more advanced fields 1 . In addition to these two examples, we observe a recent surge of studies in the field of gastroenterology 2 . The use of ML in gastroenterology is expected to significantly improve detection and characterization of colon polyps and other precancerous lesions of the Gastrointestinal (GI) tract 3 . These potential advances are mainly expected from artificial neural networks, specifically deep learning-based methods 4 . Currently, the most common applications of ML in GI are image classification (binary and multi-class) 5,6 , detection 7 , and segmentation 8 . For the purpose of the present study, we focus on binary classification as an initial step of covering the common metrics in medical applications of AI. Safe and efficient adoption of ML tools in clinical gastroenterology requires a thorough understanding of the performance metrics of the resulting models and confirmation of their clinical utility 9 . Therefore, the present study focuses on the current development of binary classification models in gastroenterology due to its timeliness and relevance. However, our discussions, recommendations, and proposed tool are valid and useful in every clinical field adopting and employing ML-based systems for binary classification tasks.
Creating strong evidence for the usefulness of ML models in clinical settings is an involved process. In addition to the relevant epidemiological principles, it requires a thorough understanding of the properties of the model itself and its performance. However, despite increased interest in ML as a medical tool, understanding of how such models work and how to aptly evaluate them using different metrics is widely lacking. In this article, we use examples of metrics and evaluations drawn from a variety of peer-reviewed and published studies in gastroenterology to provide a guide explaining different binary classification evaluation metrics, including how to interpret them. Note that we do not discuss the quality of these studies, but merely use them to discuss how different metrics give different interpretations of the quality of an ML model.
The main contributions are: We present a detailed discussion on metrics commonly used for evaluating binary ML classifiers, examine existing research using binary ML in gastroenterology along with reported metrics, and we discuss the different metrics' interpretations, usefulness, and shortcomings. To this end, we recalculate the reported metrics and calculate additional ones to further analyze the performance of the presented methods. Additionally, we present a web-based open source tool intended to let researchers perform metrics calculations easily, both for their own and other reported results to allow for comparison. 1 SimulaMet, Oslo, Norway. 2  Regardless of which metric is used, this can only be as informative as the classifier's performance on the test data. Blinding data, i.e., withholding data from those performing the experiment, is an important tool in many research fields, such as medicine. In some experiment types, it is difficult to achieve blinding, but the analysis in a setting where the data has already been collected can almost always be blinded. Misconceptions regarding the objectivity of statistical analysis should not keep researchers from blinding the data 10 . For ML analyses, such as the ones described in the present work, this means that one should set aside representative data that can be used for testing after the training and tuning processes are finished.

OPEN
Metrics. Binary classification problems can be expressed in terms of a mixture model, the total data distribution modeled as where X represents data samples, p P/N denotes the positive/negative class distributions, and α the mixture parameter of the positive class, calculated as α = N P N P +N N , with N P/N the total number of positive/negative class data samples. Studies in which the classification threshold of model outputs are tuned using a class imbalanced dataset, should investigate how these perform on other class admixtures. This is an important step to assess whether bias towards either class has been introduced and to what extent. The relevant quantities for calculating the metrics for a binary classifier are the four entries in the confusion matrix which are introduced below. Some metrics have different interpretations in the context of evaluating multi-class classification methods. Although these discussions fall outside the scope of the present one, the underlying principles still apply in multi-class settings. Furthermore, some methods have their metrics calculated on a perfinding basis, meaning there can be multiple instances for one image, and hence more positive samples than total samples (e.g., images or videos) in a dataset.
True positive (TP). The true positive denotes the number of correctly classified positive samples. For example, the number of frames containing a polyp is correctly predicted as having a polyp.
True negative (TN). The true negative denotes the number of correctly classified negative samples. For example, the number of frames not containing a polyp is correctly predicted as not having a polyp.
False positive (FP). The false positive denotes the number of samples incorrectly classified as positive. For example, the number of frames not containing a polyp is incorrectly predicted as having a polyp.
False negative (FN). The false negatives denotes the number of samples incorrectly classified as negative. For example, the number of frames containing a polyp is incorrectly predicted as not having a polyp.
Accuracy (ACC). The accuracy is the ratio between the correctly classified samples and the total number of samples in the evaluation dataset. This metric is among the most commonly used in applications of ML in medicine, but is also known for being misleading in the case of different class proportions since simply assigning all samples to the prevalent class is an easy way of achieving high accuracy. The accuracy is bounded to [0, 1], where 1 represents predicting all positive and negative samples correctly, and 0 represents predicting none of the positive or negative samples correctly.
Recall (REC). The recall, also known as the sensitivity or True Positive Rate (TPR), denotes the rate of positive samples correctly classified, and is calculated as the ratio between correctly classified positive samples and all samples assigned to the positive class. The recall is bounded to [0, 1], where 1 represents perfectly predicting the positive class, and 0 represents incorrect prediction of all positive class samples. This metric is also regarded as being among the most important for medical studies, since it is desired to miss as few positive instances as possible, which translates to a high recall.
Specificity (SPEC). The specificity is the negative class version of the recall (sensitivity) and denotes the rate of negative samples correctly classified. It is calculated as the ratio between correctly classified negative samples and www.nature.com/scientificreports/ all samples classified as negative. The specificity is bounded to [0, 1], where 1 represents perfectly predicting the negative class, and 0 represents incorrect prediction of all negative class samples.
Precision (PREC). The precision denotes the proportion of the retrieved samples which are relevant and is calculated as the ratio between correctly classified samples and all samples assigned to that class. The precision is bounded to [0, 1], where 1 represents all samples in the class correctly predicted, and 0 represents no correct predictions in the class.
where C denotes "class", and can in binary classification be either positive (P) or negative (N). The positive case of precision is often referred to as the Positive Predictive Value (PPV) and the negative case is often referred to as the Negative Predictive Value (NPV). Specifically, the PPV is the ratio between correctly classified positive samples and all samples classified as positive, and equals the precision for the positive class.
The NPV is the ratio between correctly classified negative samples and all samples classified as negative, and equals the precision for the negative class.
F1 score (F1). The F1 score is the harmonic mean of precision and recall, meaning that it penalizes extreme values of either. This metric is not symmetric between the classes, i.e., it depends on which class is defined as positive and negative. For example, in the case of a large positive class and a classifier biased towards this majority, the F1 score, being proportional to TP, would be high. Redefining the class labels so that the negative class is the majority and the classifier is biased towards the negative class would result in a low F1 score, although neither the data nor the relative class distribution has changed. The F1-score is bounded to [0, 1], where 1 represents maximum precision and recall values and 0 represents zero precision and/or recall.

Matthews correlation coefficient (MCC).
Pearson's correlation coefficient 11 takes on a particularly simple form in the binary case. This special case has been coined the MCC 12 , and become popular in ML settings for its favorable properties in cases of imbalanced classes 13 . It is essentially a correlation coefficient between the true and predicted classes, and achieves a high value only if the classifier obtains good results in all the entries of the confusion matrix (Eq. 2). The MCC is bounded to [−1, 1] , where a value of 1 represents a perfect prediction, 0 random guessing and −1 total disagreement between prediction and observation.
Threat score (TS). The TS, also called the Critical Success Index (CSI), is the ratio between the number of correctly predicted positive samples against the sum of correctly predicted positive samples and all incorrect predictions. It takes into account both false alarms and missed events in a balanced way, and excludes only the correctly predicted negative samples. As such, this metric is well suited for detecting rare events, where the model evaluation should be sensitive to correct classification of rare positive events, and not overwhelmed by many correct identifications of negative class instances. The TS is bounded to [0, 1], where 1 represents no false predictions in either class, and 0 represents no correctly classified positive samples.
We do not consider the AUROC (Area under the Receiver Operating Characteristic Curve) or AUPRC (Area under the Precision Recall Curve) since these cannot be calculated without access to the model, or from the entries of the confusion matrix. Extensive research has been done on their usefulness, and we refer the interested reader to 14 .

Methods
In the following, we identify a subset of relevant studies for our analysis. Medical studies presenting ML applications often refer to them simply as "AI systems". While AI has certainly received an unprecedented amount of attention over the past years, and presenting systems using this term emphasizes their novelty, the term is imprecise. Hence, we refrain from using this generic term in the following and instead refer to the exact model architecture used.

Study selection.
The studies used for this work are chosen based on the following rational considerations.
Our starting point is a recent review of AI in gastroenterology 15 . The review contains 138 articles, from which we select five studies that represent existing work using ML in gastroenterology. The selection criteria are as follows (i) Report sufficient information (many studies report so few metrics that it is not possible to calculate other metrics) for reproducing the reported metrics and calculating metrics not reported. (ii) Represent different cases of interest for performance metrics discussions.
Among the papers that fulfilled the aforementioned criteria, we selected papers that illustrated specific pitfalls within binary classification evaluation. In addition, we select a recent study reporting results from a large clinical trial, which was not included in the aforementioned review. The process is visualized in Fig. 1. The following contains a brief description of the selected studies and reported metrics. Study 1. Hassan et al. 16 introduce an unspecified "AI system" called GI-Genius to detect polyps, trained and validated using 2684 videos from 840 patients collected from white-light endoscopy. All 840 patients are randomly split into two separate datasets, one for training and one for validation. The validation dataset contains 338 polyps from 105 patients, where 168 of the identified polyps are either adenomas or sessile serrated adenomas. The authors report a sensitivity of 99.7% as the main performance metric, which is calculated from 337 TPs out of the total 338 positive samples. From this, readers are likely to conclude that only one FN instance is identified in the validation set. No other metrics are reported, and while it is reported that each colonoscopy contains 50,000 frames, no further details are given on the exact number of frames per video.

Study 2.
Mossotto et al. 17 use several ML models to classify diseases commonly found in the GI tract, using endoscopic and histological data. The data consist of 287 patients, from which 178 cases are Crohn's Disease (CD), 80 cases are Ulcerative Colitis (UC), and 29 cases are Unclassified Inflammatory Bowel Disease (IBDU). Results are shown from unsupervised (clustering) and supervised learning. The latter is used to classify CD and UC patients. For this, the data is divided into a model construction set consisting of 210 patients ( CD = 143, UC = 67 ), a model validation set of 48 patients ( CD = 35, UC = 13 ), and an IBDU reclassification set containing 29 IBDU patients. The model is thus not trained on IBDU data, and the latter dataset is excluded from the present discussion. The model construction set is stratified into a discovery set used to tune the parameters for CD versus UC discovery, and one for training and testing. For the best performing supervised model, tested on the test set, an accuracy of 82.7% , a precision of 0.91, a recall of 0.83 and an F1 score of 0.87 are reported, see Table 2 in 17 . On the validation set, the reported numbers are an accuracy of 83.3% , a precision of 0.86, a recall of 0.83, and an F1 score of 0.84, see Table 3 in 17 . These reported results are also listed in Table 1 20 . As the presented method is able to differentiate between different polyps within the same image, there may be more true positives than images in the dataset. This is also the reason why the metrics are reported on a per-image basis. The reported results show that the method is highly effective, with a per-image-sensitivity of 94.38% and a per-image-specificity of 95.92% . As the metrics are reported separately for images containing polyps and those that do not, recalculating the metrics as presented provides an inaccurate representation of the model's actual performance. This is because there are either no true positives or no true negatives, depending on the metrics used.
Study 5. Sakai et al. 21 propose a CNN-based system to automatically detect gastric cancer in images from colonoscopies. The model is trained on a dataset of 172,555 images containing gastric cancer and 176,388 images of normal colon. For evaluation, the model is tested on 4653 cancer images and 4997 normal images, on which it achieves an accuracy of 87.6% , a sensitivity of 80.0% , a specificity of 94.8% (see Table 1

Results
In this section, we perform a recalculation of all reported and missing metrics in the selected studies. Based on this, we discuss the usefulness of different metrics and how to obtain a realistic and complete picture of the performance of a classifier. This is done by extracting reported numbers and metrics from each study and using these to calculate additional metrics, which gives additional perspectives on the possible evaluations and could lead to different conclusions. In some cases, assumptions must be made in order to calculate metrics or assess model performances under different conditions. All assumptions made in this study are detailed in the relevant discussions.
The following discussion is structured around each selected study, where we highlight specific pitfalls that may arise due to incomplete metrics presentations. This includes a discussion on how precision and recall can Table 1. The reported metrics of the selected studies. The STUDY column represents each of the five studies selected for metric recalculation. The SET column is the different metrics calculated for the same set of data. The REPORTED column is how the metrics were reported in the respective study. To refer to the tables in each respective paper, we use T to refer to the table number and R for the row number. The EVALUATION column is the method used to generate the metrics. The TOTAL column is the total number of samples used in the metrics calculations. The POS and NEG columns represent the total number of positive and negative samples, respectively. The remaining columns correspond the aforementioned metric acronyms described in the main text.

Pitfall 1: precision and recall.
To reproduce the results of Study 1 16 , it is necessary to make some assumptions. Primarily, no information is given regarding the total number of frames for all videos, but as an average of 50,000 frames per video is reported, we use this to calculate the total number. Further, we calculate two sets of metrics, see Table 2 under study 1. For the first row, we calculate TP = 337, TN = 16,730,662, FP = 169,000, and FN = 1 using the same per polyp detection evaluation as Study 1. In the second row, we assume that ten seconds around the polyp are either detected correctly or missed with a frame rate of 25 fps. This yields TP = 84,500, TN = 16,646,500, FP = 169,000, and FN = 250, which are used in our calculations. FP for both calculations are obtained based on the reported 1% FPs per video, i.e., 50,000 100 = 500 . These assumptions yield two sets of results for the evaluation, which, if considered jointly, give a more thorough understanding of the performance. In any case, the reported values are not sufficient for reproducibility without making assumptions.
Assuming the most optimistic case amounts to mixture components of 0.995 and 0.005 for the positive and negative classes, respectively, meaning extremely imbalanced classes. The authors report a recall of 99.7% , in which case we calculate a precision of 0.33. Clearly, the recall must be interpreted with care in cases of strongly imbalanced classes. The reason is that precision and recall are both proportional to TP, but have an inverse mutual relationship: High precision requires low FP, so a classifier maximizing precision will return only very strong positive predictions, which can result in positive events missed. On the other hand, high recall is achieved by assigning more instances to the positive class, to achieve a low FN. Whether to maximize recall or precision depends on the application: Is it most important to identify only relevant instances, or to make sure that all relevant instances are identified? Regardless of which is the case, this should be clearly stated, and both metrics should be reported. The balance between the two has to be based on the medical use case and associated requirements. For example, some false alarms are acceptable in cancer detection, since it is crucial to identify all positive cases. On the other hand, for the identification of less severe disease with high prevalence, it can be important to achieve the highest possible precision. A low precision combined with a high recall implies that the classifier is prone to set of false alarms (FPs), which can result in an overwhelming manual workload and time wasted. Pitfall 2: mixture parameter dependent tuning. In Study 2, Mossotto et al. 17 split the model construction dataset into two subsets of equal size and class distribution, with mixture components 0.68 and 0.32 for the two classes CD and UC. One of these subsets is used to tune parameters to maximize CD versus UC classification, meaning that the classification task is done with the underlying assumption that the class admixture will remain constant. This is trivially true for the training and test data, being the other of the two subsets, but not for the validation set, where the corresponding mixture components are 0.73 and 0.27. The authors do not mention the deviation from the tuned admixture, nor do they investigate systematically how much a given deviation affects the reported performance metrics. Without access to the resulting model, this cannot be investigated further in this article or by other interested readers. Consequently, there is no way of knowing the sensitivity the presented method has to the class admixture. www.nature.com/scientificreports/ The study does not report any of the confusion matrix entries, thus not facilitating the process of reproducible results. However, we can derive the TN by multiplying the total number of positive samples with the recall, TP = REC × (TP + FN) . The FP are then obtained via FP = TP PREC − TP . From this, we can calculate the reported and missing metrics. The authors report the positive class precision, which represents the PPV. In addition, the NPV should also have been reported for completeness. As shown in Table 2, the NPV is lower for the validation datasets where the precision is high, and vice versa. Calculating the MCC, this is stable around 0.63 over all validation sets listed in Table 3 of 17 , although not as high as any of the reported metrics.
Pitfall 3: blinded data. In Study 3, Byrne et al. 18 remove data samples for which the classifier does not achieve high confidence, as well as videos with more than one polyp. They calculate the model's performance for the different videos in the study, and remove the ones on which the model performs poorly. As such, the results reported from the study concerns a very specific selection of their data, made after the model has been adjusted. Excluding data on which a model performs poorly leads to a misrepresentation of its abilities and should not be done. If the classification task is too difficult or the removed data was faulty, this should instead be reported, and a new classifier should be trained for a more limited task. The metrics reported from a study should be calculated after the final model calibration and subsequent testing on blinded data.  Table 2 in 19 , or metric set 1 under study 4 in Table 1. For the first two of these, sufficient numbers are reported to reproduce the reported metrics, as well as to calculate the corresponding positive class precisions, which are reported as 0.96 and 0.97, respectively. In the third case, the positive class precision cannot be calculated since the positive and negative samples are separated for the test. No explanation or reason is given regarding why the tests are performed only on the separated classes and not together, which would give a better overall impression of the performance. For the first dataset, it is not clear how the numbers are calculated, as no test set is mentioned. This could mean that the reported sensitivity is calculated on the training data.
Rows five to seven in Table 2 show the results achieved when combining the negative and positive class samples. Since we do not know if the obtained model is biased towards the negative or positive class, we present three evaluations: In row five, we assume that the positive and negative results can simply be combined, which gives an overall MCC of 0.84 and NPV of 0.98, indicating good performance. In rows six and seven, we assume that the model is biased towards the positive or negative class, respectively. The resulting MCCs are both −0.05 and the NPVs both 0. This means that the classifier, which seemed to perform exceptionally well based on the reported numbers, is actually severely under-performing on the negative class. Besides these ambiguities, the results for the first three datasets indicate strong performance, but using the same numbers to calculate metrics more sensitive to bias, reveals severe under-performance (see Tables 1 and 2 for all reported and calculated numbers).
While detection and classification are in principle the same task for a fixed number of instances per class, the study in 19 faces a challenge: The negative class is unbounded, i.e., the number of negative instances is undefined. The more sensitive the classifier is, the larger the negative class effectively becomes, as the classifier generates FP instances, and the negative class instances can be calculated as Neg = TN + FP . In general, evaluating without clearly defining boundaries for the classes is risky, as it can lead to an unclear impression of the model performance, in either the positive or negative direction. It is also nearly impossible for follow-up studies to reproduce and compare results.
Without a well-defined number of true negatives in a video (or set of images), each of the frames not containing a polyp and each of the pixels not being part of a polyp are in principle true negatives. Optimally, the classes should instead be balanced, at best with a mixture parameter of 0.5. If this is not possible, the study should at least be based on well-motivated assumptions informed by real-world properties. For example, a standard colonoscopy contains on average n number of frames and m polyps found per examination. Most colonoscopies take less than an hour, so assuming a 24 h time frame would be an unreasonable assumption within such boundaries.
Keeping the positive and negative classes separated in an evaluation can lead to misleading results and can make a model appear very different in terms of performance, depending on the presentation. The most important question that one should ask before performing the evaluation is: Which evaluation and metrics provide the most accurate representation regarding how the model will perform in the real world? This needs to be an overarching picture including both classes and a set of diverse and well-suited metrics.
Pitfall 5: class dependent performance. Study 5 21 contains a confusion matrix, enabling us to calculate most metrics for the reported results. As shown in Table 1, reported accuracy is 0.88, the specificity 0.95, the sensitivity 0.80, and the PPV 0.93. The PPV indicates the precision for the positive class and should be accompanied with the corresponding metric for the negative class, i.e., the NPV, which we calculate to be 0.84. This could indicate that the model is better at classifying positive than negative samples correctly, which would be surprising, given that the model is trained on slightly more negative than positive class images. However, not knowing the loss function or which measure the model was optimized for, we cannot investigate this further. What we do know is that a large number of FNs directly cause the low NPV: On the test dataset, the model has a significantly higher number of FNs, meaning missed detection, than FPs, meaning a false alarm and in this case overdetection of cancer. The study specifically states reducing misdetection to be the primary motivation for using ML assisted diagnosis, and should thus have reported metrics providing a more comprehensive representation of the model's performance in this regard. For instance, the MCC, which measures the correlation between the true and predicted classes, and is high only if the prediction is good on the positive and the negative class. By using the reported results, we calculate the MCC to be 0.76, which is still an acceptable performance, although not as www.nature.com/scientificreports/ high as the metrics reported by 21 . Which metric values are acceptable depends on non-technical aspects, e.g., the human performance baseline or requirements from hospitals or health authorities. A potential weakness associated with the NPV is its dependence on TNs, which can overwhelm a classifier whose purpose is detecting rare events. In such cases, the TS, which does not take TNs into account, can be advantageous. From the values reported in Table 1 of study 21 , the TS value is 0.76, again indicating that the model performs sub-optimally with respect to the objective, despite achieving high accuracy and precision values. In conclusion, the reported metrics show that the model, for the most part, performs well on the evaluation dataset. When taking the recalculated metrics into account, we see that the model is more prone to misdetection than causing false alarms.
MediMetrics. Together with this study, we release a web-based tool called MediMetrics for calculating the metrics shown in Table 2, to make them easily accessible for medical doctors and ML researchers alike. From the provided input, the tool automatically calculates all possible metrics and generates useful visualizations and comparisons the user may freely use in their research. The tool is open-source, and the code is available on GitHub (www. github. com/ simula/ medim etrics). In the future, we plan to provide a repository of publicly available studies that have been evaluated using our tool. Moreover, we plan to expand the tool to support other methodologies as well, such as multi-class classification, detection, and segmentation methods.

Discussion
There are many metrics that can be used to evaluate binary classification models. Using only a subset could give a false impression of a model's actual performance, and in turn, yield unexpected results when deployed to a clinical setting. It is therefore important to use a combination of multiple metrics and interpret the performance holistically. Calculating multiple metrics does not require extra work in terms of study design or time, thus there is no apparent reason not to include a set of metrics, besides lack of space, obfuscating actual performance, or lack of knowledge regarding classifier evaluation. Besides interpreting the different metrics together, metrics for the separate classes should be calculated individually. Special care should be taken in cases of imbalanced classes, and the robustness of the classifier's performance tested over a range of class admixtures. In general, a high score in any metric should be regarded with suspicion.
Training and evaluation sets should be strictly separated: Optimally, the data should be split into training, validation, and test datasets. The test dataset should be separate from the other partitions to avoid introducing bias on the parameters set during the tuning phase. Furthermore, data regarding the same instance should not be shared across data splits. For example, frames of the same polyp from different angels should not be shared across the training and test datasets. Once the model's performance has been optimized on the training data, including tests on a validation set, it can be finally evaluated using the test set. This last step should thus not involve additional tuning, and the test data should not be made available to the analysis before results are fixated for publication. We argue strongly that this should be the standard for studies on the performance of ML classifiers used in medicine in the future. If possible, cross-dataset testing should be performed, meaning in this context that the training and test data are obtained from different hospitals or at least different patients.
In general, all studies involving classification should report the obtained TP, FP, TN, and FN values for validation and test data. In addition, the data along with either the source code, the final models or both should be made available. If this is not possible, other alternatives, like performing additional evaluation on public datasets, such as Kvasir 22 or the Sun database 23 , should be considered. If such an alternative is chosen, it is important to check if the test data is outside the distribution of the training data 24 , and in that case, re-fit the model's parameters. Although public datasets do not match the purpose of the study, evaluating the model on such data, by either re-training it or applying it directly on the data if similar to the initial data used for training, would allow others to compare methods and results.